How to keep track of filenames for field source? #365

ThibHlln · 2022-03-28T14:06:12Z

Hi Sadie, Hi David, 🙂

In unifhy we have been using cf.Field.get_filenames for a while to track down the source files of the user input fields so that they can be stored in a configuration file for potential later reuse. And I have only recently being faced with the scenario where "If all of the data are in memory then an empty set is returned", meaning that get_filenames does not return the information we are looking for anymore (unifhy-org/unifhy#80).

So, I am wondering:

is there another attribute/property/method of cf.Field that always keeps the filenames of a field regardless of whether its data fits in memory?
if not, would it make sense for cf-python to provide such functionality? e.g. not to drop the filenames even if the data fits in memory (I am guessing it wouldn't, otherwise you would already have implemented it 🙂)

Thank you in advance for your help,
Thibault

The text was updated successfully, but these errors were encountered:

davidhassell · 2022-03-30T15:12:47Z

Hi Thibault - good to hear from you. As you've guessed, it's complicated!

The short answer to your particular problem is perhaps to manually save the file names straight after the read step:

>>> import cf
>>> cf.write(cf.example_field(0), '~/delme.nc')
>>> f = cf.read('~/delme.nc')[0]
>>> f._custom['saved_filenames'] = f.get_filenames()
>>> f._custom['saved_filenames']
{'/home/david/delme.nc'}

Doing it this way, by adding it to the _custom dictionary as opposed to just setting the non-reserved attribute f.saved_filenames = ..., ensures that they'll get copied if you do g = f.copy().

I'm can't think of a reason why we couldn't formalise this to, say:

>>> f.get_filenames(save=True)
>>> f.saved_filenames  # now a reserved attribute
'/home/david/delme.nc'

This method with save=True would save the output of get_filenames (which could be an empty set) regardless of whether or not names had previously been saved. Would that be useful?

So what are the complications? As usual it's ambiguities and corner cases. If array values have been entirely overwritten (f += 1), then the presence of saved filenames could be misleading to some people, but not others. . Similarly if only some of the contributing files have been made "redundant". Note also that the files include files names which contain coordinates and other metadata - these usually, but not always, will be in the same files as the data ....

I final note, which might make all this moot (at least for you!) is that soon we'll be releasing the first dask version of cf-python. Because the dask data stores up operations lazily, the original filenames will still be present and returnable by get_filenames, even if you did f += 1. of course, you can still lose this information by forcing the operations to be computed internally (cf. da.array.Array.persist()), but it could open up more possibilities.

Anyway, let us know if you'd like f.get_filenames(save=True) implemented, and we'll get right on it - it will be a very quick implementation.

All the best,
David

ThibHlln · 2022-03-31T13:46:36Z

Hi David,

Thank you for your detailed reply, as always. 🙂

I think the manual option you suggest is perfectly acceptable. I agree with you that "saved filenames" could be misleading to some if the field has been altered in such a way that it is no longer the same as the one in the file anymore. So it is probably best not to implement it, although it is not up to me to decide!

Thank you for your help!
Take care,
Thibault

davidhassell · 2022-11-15T10:36:26Z

Closing now, since at 3.14.0 we will have both original filenames (#448) and "live" filenames (#498).

ThibHlln added the question General question label Mar 28, 2022

ThibHlln changed the title ~~How keep track of filename for field source?~~ How to keep track of filenames for field source? Mar 28, 2022

ThibHlln mentioned this issue Mar 31, 2022

filename not saved in YAML configuration file unifhy-org/unifhy#80

Closed

davidhassell mentioned this issue Jun 13, 2022

dask: Dask.get_filenames (2) #408

Merged

This was referenced Sep 9, 2022

New "original filenames" methods NCAS-CMS/cfdm#215

Closed

New "original filenames" methods #448

Closed

davidhassell closed this as completed Nov 15, 2022

davidhassell added the netCDF read Relating to reading netCDF datasets label Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to keep track of filenames for field source? #365

How to keep track of filenames for field source? #365

ThibHlln commented Mar 28, 2022

davidhassell commented Mar 30, 2022

ThibHlln commented Mar 31, 2022

davidhassell commented Nov 15, 2022

How to keep track of filenames for field source? #365

How to keep track of filenames for field source? #365

Comments

ThibHlln commented Mar 28, 2022

davidhassell commented Mar 30, 2022

ThibHlln commented Mar 31, 2022

davidhassell commented Nov 15, 2022