What makes the NAME loader faster than the NetCDF loader? #5053

trexfeathers · 2022-11-07T15:56:47Z

📰 Custom Issue

@bsherratt has provided an interesting case. A Name III .txt file describing:

200 x 200 grid
~200 phenomena

The file takes ~20s to load. If Iris is used to convert the file to NetCDF, that NetCDF file takes ~105s to load. Both loaders are unreasonably slow in this example, since the grid is shared between all phenomena, yet the Coord instances are created on a Cube-by-Cube basis (see #3172 for discussion on sharing). But why is NetCDF even slower?

File available on request.

The text was updated successfully, but these errors were encountered:

vsherratt · 2022-11-08T15:28:35Z

Just to update...

We had a theory that "breaking" the coordinate metadata could speed this up, and it really did.

Pre-processing: rename each variable's coordinates attribute to something that iris does not recognise. It will therefore not attempt to create any aux coords on load (so they show up as cubes instead).

netCDF4 example

with nc.Dataset(filename, "a") as file:
    for var in file.variables.values():
        if "coordinates" in var.ncattrs():
            var.renameAttribute("coordinates", "fake_coordinates")
    file.sync()

Post-processing: identify the "cubes" that were supposed to be aux coords, and add them to the relevant cubes. This can be done by overriding the usual netcdf loaders in iris.fileformats.FORMAT_AGENT with a simple wrapper around iris.fileformats.netcdf.load_cubes:

Incomplete example

def load_cubes(filenames, callback=None, constraints=None):
    # Identify coords
    cubes = []
    unknown = []
    coord_names = set()
    for cube in netcdf.load_cubes(filenames, callback, constraints):
        fake_coords = cube.attributes.get("fake_coordinates", "").split()
        if fake_coords:
            # Definitely a cube
            cubes.append(cube)
            coord_names.update(fake_coords)
        else:
            # Might be a cube or a coord
            unknown.append(cube)

    # Create the coords
    coords = {}
    for cube in unknown:
        if cube.name() in coord_names:
            coord = AuxCoord(cube.data)
            coord.metadata = cube.metadata
            coords[coord.var_name] = coord
        else:
            cubes.append(cube)

    # Add coords to cubes
    for cube in cubes:
        fake_coords = cube.attributes.pop("fake_coordinates", "").split()
        for name in fake_coords:
            # TODO: Assumes scalar...
            cube.add_aux_coord(coords[name])

        yield cube

This doesn't cover everything, such as bounds or non-scalar aux coords, but it did bring the load time down to ~3 seconds. The re-adding coordinates part is a tiny proportion of that so I don't imagine a full solution would take much longer either.

However... in all .nc cases the cube data is lazy, while the NAME loader does not create lazy data. Realising the data takes about 50s, so netCDF is still slower overall.

pp-mo · 2022-11-15T16:18:18Z

#5069 shows a really basic test where we bypass lazy array creation in _get_cf_var_data, if the array is "small".
This really speeds up the testcase for netcdf load -- from ~150sec to ~4secs ...

Like this...

_NONLAZY_SMALL_AUXCOORDS = True

def _get_cf_var_data(cf_var, filename):
    if _NONLAZY_SMALL_AUXCOORDS and np.prod(cf_var.shape) < 20:
        result = cf_var[:]
    else:
        # Get lazy chunked data out of a cf variable.

Of course, #3172 would be a solution, but sadly that still looks a long way off.

This sounds like a possible win, as-is ??
It doesn't seem like a breaking change.

How do we choose a size threshold ?

trexfeathers · 2023-04-12T13:10:53Z

Closed by #5229

trexfeathers added Type: Performance Status: Needs Info labels Nov 7, 2022

pp-mo mentioned this issue Nov 15, 2022

Make small AuxCoords (including scalar) non-lazy, for efficiency #5069

Closed

pp-mo mentioned this issue Apr 3, 2023

Use real array for data of of small netCDF variables. #5229

Merged

trexfeathers closed this as completed Apr 12, 2023

trexfeathers mentioned this issue Oct 3, 2024

Iris won't load netCDF files from the new CDS-beta #6149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What makes the NAME loader faster than the NetCDF loader? #5053

What makes the NAME loader faster than the NetCDF loader? #5053

trexfeathers commented Nov 7, 2022

vsherratt commented Nov 8, 2022 •

edited

Loading

pp-mo commented Nov 15, 2022 •

edited

Loading

trexfeathers commented Apr 12, 2023

What makes the NAME loader faster than the NetCDF loader? #5053

What makes the NAME loader faster than the NetCDF loader? #5053

Comments

trexfeathers commented Nov 7, 2022

📰 Custom Issue

vsherratt commented Nov 8, 2022 • edited Loading

pp-mo commented Nov 15, 2022 • edited Loading

trexfeathers commented Apr 12, 2023

vsherratt commented Nov 8, 2022 •

edited

Loading

pp-mo commented Nov 15, 2022 •

edited

Loading