Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iris's load and load_cubes utilities return a CubeList with its elements in random order #3314

Closed
evertrol opened this issue May 15, 2019 · 10 comments

Comments

@evertrol
Copy link

When loading a dataset that has multiple cubes, load() and load_cubes() will return a CubeList with its elements in random order. This makes it harder for following code to be reproducible: indexing into the list isn't guaranteed to always provide the same cube.

Perhaps there is a good reason CubeList is inherently unordered, but since list (from which it derives) is inherently ordered, I feel it's better to have it be ordered, so that there are no surprises.


The cause seems to in the iris.fileformats.netcdf.load_cubes function, where it calls the .values() method of the various CFGroup objects.
One solution could be (untested) to make CFGroup._cf_variables a collections.OrderedDict (and perhaps the other attributes of CFGroup as well).

Potentially, this may also apply for various other dicts, or even the MutableMapping base class of CFGroup.

@DPeterK
Copy link
Member

DPeterK commented May 16, 2019

I'd assert this is a feature not a bug – CubeLists are, and should be expected to be, unordered (think pre-Python 3.6 dictionary keys). If you wish to reproducibly select a cube from the CubeList you should use a Constraint that's defined by metadata of interest and extract from the CubeList using that Constraint. This is a generally more robust approach than indexing.

@evertrol
Copy link
Author

Ah, right. I got thrown off by its name and parent class (list), and of course methods like append and insert are inherently ordered. I guess it's more a multi-set type of class.

Thanks, I'll use the constraint route.

@rcomer
Copy link
Member

rcomer commented May 29, 2019

Hmmm, I think once loaded cubelists are ordered aren't they? If I do:

new_cubelist = iris.cube.Cube([cube.collapsed('foo', iris.analysis.MEAN) for cube in old_cubelist])

then I can expect the order in my new cubelist to be consistent with the old.

However, I think I'd always be nervous about assuming a particular order in a freshly loaded cube, regardless of how iris.load is written. What happens if another process changed your NetCDF file on disk since you last used it?

@bjlittle
Copy link
Member

bjlittle commented Jun 2, 2019

Thanks @evertrol for this issue. Much appreciated! 👍

We've never intended the Cubes within a CubeList to in any specific order. This was a design choice from outset of iris, and we'd be hesitant to enforce an ordering - simply because there is no obvious way to do it in a way that everyone would agree with.

So the deal is that you can't rely on any particular ordering - if you consider it random, then that's a pretty safe assumption to make. We've left the ordering down to the user to decide, if that matters to them, post loading. As @DPeterK mentioned, use a constraint to refine and extract the cubes that you care about, rather than indexing - indexing is a very weak contract, there is no guarantee that the same cube will be at the same index when comparing independent loads of the same file/s.

@evertrol If you don't mind, I'm going to close this issue. However, if you feel that you want to re-open it as a means to force an open discussion on Cube ordering within a CubeList, then please do so, and the community will take it from there 😄

@bjlittle bjlittle self-assigned this Jun 2, 2019
@bjlittle bjlittle closed this as completed Jun 2, 2019
@evertrol
Copy link
Author

Not a problem having this issue closed. It's mostly arguing semantics, and I wouldn't expect to see an API change/rename. As DPeterK remarks, it's mostly a dict keys style (or dict values, if items can be repeated).
Is it worth to put a remark in the documentation about it, either (or both) at the main reference documentation page or the load_cubes() reference documentation?

@bjlittle
Copy link
Member

@evertrol Thanks.

That seems like a happy compromise to me, so I'm going to re-open this issue with the task of updating the documentation and/or doc-strings to highlight the orders of cubes returned within a CubeList so that it isn't a surprise to users.

@lbdreyer
Copy link
Member

This has been addressed in @stephenworsley 's PR: #3370 so I'm going to close this issue

@edmundhenley-mo
Copy link

edmundhenley-mo commented Apr 12, 2021

Hiya all - see this issue is closed (and the reasoning given above). See also that the docs are nicely explicit about lack of inherent order.

However... is there any option/feeling re option of adding a keyword to iris.load such that a user can get a sorted cubelist out if iris.load ends up returning a cube list?

Straw API
cube_list = iris.load(my_file.nc, sort_order=<something>)
where <something> could be (say) one of following:

  • a set of common keywords (e.g. ["alphabetical", ...])
  • an option for (say) a user-defined function to determine this (with docs giving some sample functions implementing say alphabetical based on cube.name, reverse alphabetical, field size in bytes, ...)
  • other?

Whichever of these is used, there could be a default sort_order=None (or similar) default to preserve the current unsorted behaviour as default.

Use-case for this:
I'm trying to use a notebook to document development of scripts to wrangle some ~nasty data which results in a cube list.

Notebooks of course not natively the best for diffing generally (thanks to several of you I know there are lots of tools to help with this!).

But in this case the cubelist being unordered adds needless extra diff line noise I could do without, and quite a lot of it as I'm using html-formatted output.

Not a major issue - but enough of a niggle that I thought worth poking this issue.

Cons
Realise there may be a -1 (more?) aspect in that anything like this may encourage people to index in to the sorted cubelist. Leading to opaque code etc. I hereby solemnly swear never to!.

Moot, but FWIW I do agree as with original poster that this is a bit of a gotcha given list in CubeList name, especially as you can even ~rely on dicts' sort order now!

Overall
IMO would be nice if there were a ~easy option to allow sorting for use-cases like this. Perhaps by making the default unsorted it wouldn't become something people would rely on too much (indexing etc), but could be opted into by those who need it.

@jonseddon
Copy link
Contributor

Could you sort your CubeList yourself after loading? For example:

cubes.sort(key=lambda cube: cube.var_name)

@edmundhenley-mo
Copy link

Definitely! Perfect, thanks Jon! For my use-case this makes my request ignorable, as I can sort after loading but before printing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants