Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to keep track of filenames for field source? #365

Closed
ThibHlln opened this issue Mar 28, 2022 · 3 comments
Closed

How to keep track of filenames for field source? #365

ThibHlln opened this issue Mar 28, 2022 · 3 comments
Labels
netCDF read Relating to reading netCDF datasets question General question

Comments

@ThibHlln
Copy link

Hi Sadie, Hi David, 🙂

In unifhy we have been using cf.Field.get_filenames for a while to track down the source files of the user input fields so that they can be stored in a configuration file for potential later reuse. And I have only recently being faced with the scenario where "If all of the data are in memory then an empty set is returned", meaning that get_filenames does not return the information we are looking for anymore (unifhy-org/unifhy#80).

So, I am wondering:

  • is there another attribute/property/method of cf.Field that always keeps the filenames of a field regardless of whether its data fits in memory?
  • if not, would it make sense for cf-python to provide such functionality? e.g. not to drop the filenames even if the data fits in memory (I am guessing it wouldn't, otherwise you would already have implemented it 🙂)

Thank you in advance for your help,
Thibault

@ThibHlln ThibHlln added the question General question label Mar 28, 2022
@ThibHlln ThibHlln changed the title How keep track of filename for field source? How to keep track of filenames for field source? Mar 28, 2022
@davidhassell
Copy link
Collaborator

Hi Thibault - good to hear from you. As you've guessed, it's complicated!

The short answer to your particular problem is perhaps to manually save the file names straight after the read step:

>>> import cf
>>> cf.write(cf.example_field(0), '~/delme.nc')
>>> f = cf.read('~/delme.nc')[0]
>>> f._custom['saved_filenames'] = f.get_filenames()
>>> f._custom['saved_filenames']
{'/home/david/delme.nc'}

Doing it this way, by adding it to the _custom dictionary as opposed to just setting the non-reserved attribute f.saved_filenames = ..., ensures that they'll get copied if you do g = f.copy().

I'm can't think of a reason why we couldn't formalise this to, say:

>>> f.get_filenames(save=True)
>>> f.saved_filenames  # now a reserved attribute
'/home/david/delme.nc'

This method with save=True would save the output of get_filenames (which could be an empty set) regardless of whether or not names had previously been saved. Would that be useful?

So what are the complications? As usual it's ambiguities and corner cases. If array values have been entirely overwritten (f += 1), then the presence of saved filenames could be misleading to some people, but not others. . Similarly if only some of the contributing files have been made "redundant". Note also that the files include files names which contain coordinates and other metadata - these usually, but not always, will be in the same files as the data ....

I final note, which might make all this moot (at least for you!) is that soon we'll be releasing the first dask version of cf-python. Because the dask data stores up operations lazily, the original filenames will still be present and returnable by get_filenames, even if you did f += 1. of course, you can still lose this information by forcing the operations to be computed internally (cf. da.array.Array.persist()), but it could open up more possibilities.

Anyway, let us know if you'd like f.get_filenames(save=True) implemented, and we'll get right on it - it will be a very quick implementation.

All the best,
David

@ThibHlln
Copy link
Author

Hi David,

Thank you for your detailed reply, as always. 🙂

I think the manual option you suggest is perfectly acceptable. I agree with you that "saved filenames" could be misleading to some if the field has been altered in such a way that it is no longer the same as the one in the file anymore. So it is probably best not to implement it, although it is not up to me to decide!

Thank you for your help!
Take care,
Thibault

@davidhassell
Copy link
Collaborator

Closing now, since at 3.14.0 we will have both original filenames (#448) and "live" filenames (#498).

@davidhassell davidhassell added the netCDF read Relating to reading netCDF datasets label Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
netCDF read Relating to reading netCDF datasets question General question
Projects
None yet
Development

No branches or pull requests

2 participants