Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent variable naming #30

Merged
merged 10 commits into from
Aug 21, 2020

Conversation

leifdenby
Copy link
Collaborator

This pull-request adds functionality to address #26

@leifdenby
Copy link
Collaborator Author

There is one potential issue with the current implementation: If there there exists in a dataset a variable with the "correct" variable (for example T) for temperature that will always be picked, even if there are multiple variables in the dataset which have the air_temperature standard_name attribute. Are we ok with this?

Also, when I create a combined DataArray (so we can analyse/plot multiple variables at once which have for example air_temperature standard_name) I don't currently check the units. I was going to just check that the "units" string is the same for all variables, is that ok with everyone? I don't feel like this would be used alot, but sometimes you might want to make profile plots with multiple sources for temperature say...

@leifdenby leifdenby mentioned this pull request Aug 11, 2020
@leifdenby
Copy link
Collaborator Author

@annalea-albright and @JuleRadtke it would be great to have your thoughts on this too, particularly: https://github.com/eurec4a/eurec4a-environment/pull/30/files#diff-ba464340b171c9b9665c96a45aa313caR10

Does this make sense to you?

@leosaffin
Copy link
Collaborator

This looks good. It seems to implement the initial steps of the calculate function I was working on but with better error messages/handling and the nice feature of concatenating multiple variables with the same standard name. I would get this merged in soon to get into the habit of using this for new functions. There's a couple of things I would change before the merge:

  1. I think the function get_field is generic enough that you should put it into the top-level init.py file. It would be good to be able to do from eurec4a_environment import get_field
  2. It would be good to put in an option to suppress the concatenating of multiple fields with a default of False, so when you get to the the len(matching_arrays) being more than 1 section you would just raise an error message. This way, if you know you're in a function that requires only one variable returned you can set that flag.

After merge I would still like to get the calculate function involved here. It could then work by initially calling get_field with a try/except on FieldMissingException.

TEMPERATURE: "air_temperature",
ALTITUDE: "geopotential_height",
RELATIVE_HUMIDITY: "relative_humidity",
POTENTIAL_TEMPERATURE: "potential_temperature",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CF standard is actually "air_potential_temperature". I think this is an inconsistency with the JOANNE dataset

Copy link
Collaborator

@Geet-George Geet-George Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the correction. I'll update it for the next JOANNE version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @LSaffin. I'll check this and make sure we use the next JOANNE version as soon as it's ready. Maybe you could give me a ping on mattermost @Geet-George and then I'll update the intake catalog?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll update you on it. 👍

Copy link
Collaborator

@annalea-albright annalea-albright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is wonderful!

Copy link
Collaborator

@Geet-George Geet-George left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great implementation, @leifdenby :)

import xarray as xr


TEMPERATURE = "T"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TEMPERATURE = "T"
TEMPERATURE = "ta"

This is just to be consistent with the latest variable names in JOANNE - most variable names are finalised, but some are still being discussed.. :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem :)



TEMPERATURE = "T"
ALTITUDE = "height"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALTITUDE = "height"
ALTITUDE = "alt"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll update these. Should I update the JOANNE dataset in the intake catalog already?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these get updated?

var_dataarrays = list(matching_dataarrays.values())

# TODO: should we check units here too?
da_combined = xr.concat(var_dataarrays, dim="var_name")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is well thought of. Good implementation! However, all functions would have to be written to deal with this, i.e. multiple variables. e.g. estimating theta would require an array each of pressure and temperature. If get_field returns two temperature arrays and one pressure array, the function should know what to do... Ideally, return two theta arrays. Did I understand that correctly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😄 Yes, the idea would be that what is returned is always temperature for example, but it might be multiple temperature profiles. I've elaborated a bit on this idea in my comment below. Hope that makes sense...

@leifdenby
Copy link
Collaborator Author

leifdenby commented Aug 12, 2020

Thanks for the feedback 😄

1. I think the function `get_field` is generic enough that you should put it into the top-level **init**.py file. 
   It would be good to be able to do `from eurec4a_environment import get_field`

Good idea @LSaffin . I'll do that

2. It would be good to put in an option to suppress the concatenating of multiple fields with a default of `False`, so 
   when you get to the the `len(matching_arrays)` being more than 1 section you would just raise an error message. 
   This way, if you know you're in a function that requires only one variable returned you can set that flag.

The way I am thinking about it is that all the operations work on vertical columns, i.e. they're a reduction bit like ds.mean(dim="altitude") and so they should be able to take DataArrays with any number of dimensions and just reduce the "height" dimension.

The extra dimensions could be for example time or sonde_number as we've had before, but they could also just be var_name (representing a number of DataArrays that were previously separate, but have now been merged into a single DataArray with a new var_name dimension). We could simply say that we don't want to merge when they are multiple DataArrays with the matching standard_name, but I do think that the functions we implement have to work independently of the other dimensions. This will also mean that with for example LES data (which could have the dimensions (x, y, z)) we could apply the functions by just setting the correct standard name on z. What do you think @LSaffin? This relates to your question too @Geet-George

@leifdenby
Copy link
Collaborator Author

leifdenby commented Aug 12, 2020

After merge I would still like to get the calculate function involved here. It could then work by initially calling get_field with a try/except on FieldMissingException.

This sounds great @LSaffin! Yes, this very much does half the work. I was thinking we'd have a dictionary somewhere (not sure where yet, do you have an idea?) with a mapping like so:

DERIVED_FIELDS = dict(
    z_LCL="variables.boundary_layer.lcl.find_LCL_Bolton",
    ...
)

Together with the code you've written this would provide the lookup for the module and function name that is needed to calculate a specific variable. We can then update this dictionary over time to provide more variables that could be used in for example the plotting functions

eurec4a_environment.plot.profile.profile_plot_1d(ds, variable="qt", height_levels=["z_LCL",...])

I'm realising we could even get rid of the z_ part if we're passing in the argument as height_levels, but I'm maybe getting a bit ahead of myself here...

@leosaffin
Copy link
Collaborator

The way I am thinking about it is that all the operations work on vertical columns, i.e. they're a reduction bit like ds.mean(dim="altitude") and so they should be able to take DataArrays with any number of dimensions and just reduce the "height" dimension.

The extra dimensions could be for example time or sonde_number as we've had before, but they could also just be var_name (representing a number of DataArrays that were previously separate, but have now been merged into a single DataArray with a new var_name dimension). We could simply say that we don't want to merge when they are multiple DataArrays with the matching standard_name, but I do think that the functions we implement have to work independently of the other dimensions. This will also mean that with for example LES data (which could have the dimensions (x, y, z)) we could apply the functions by just setting the correct standard name on z. What do you think @LSaffin? This relates to your question too @Geet-George

I see what you mean, this will be good for any functions calculating variables, they can and should try to be this flexible. I'm not sure how you would want to do it for plotting functions though. A few options I can think of:

  1. Allow them to pass a flag that raises an error in the get_field function
  2. You could also make the plot for every extra element in the new dimension
  3. Leave as is and let the matplotlib function do the error checking

I guess 2 sounds best but may be a bit excessive for more detailed plotting functions. Maybe don't hold up the pull request on this but keep it as an open issue.

@leosaffin
Copy link
Collaborator

leosaffin commented Aug 12, 2020

I was thinking we'd have a dictionary somewhere (not sure where yet, do you have an idea?) with a mapping like so:

DERIVED_FIELDS = dict(
    z_LCL="variables.boundary_layer.lcl.find_LCL_Bolton",

I would put the dictionary in the same place as the calculating function. It can be useful to import the dictionary and see the list of variables that can be calculated.

from eurec4a_environment.NAME_OF_MODULE import NAME_OF_FUNCTION, DERIVED_FIELDS
print(DERIVED_FIELDS.keys())

Names to be decided. I don't think my previous naming is optimal.

I'm realising we could even get rid of the z_ part if we're passing in the argument as height_levels, but I'm maybe getting a bit ahead of myself here...

Probably yes. We can look back into this once this pull request is resolved.

Copy link
Collaborator

@JuleRadtke JuleRadtke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, very nice!!

@leifdenby
Copy link
Collaborator Author

Thanks for the feedback again all. I'll work in the suggestions you made and the merge this 🚀

A common pattern when accessing a field is to then convert this field to
be in the required units. This commit adds functionality to use
`cf_units` to carry out this conversion automatically when the user
requests a valid set of units.
All test and existing variable calculations changed to use module-wide
nomenclature
It appears that `cfunits` is easier to install via pip (it doesn't break
on the CI system), so we use that instead.
@leifdenby
Copy link
Collaborator Author

In implementing this I realised that a common pattern in our functions it to ensure the fields we are operating on are in the correct units and so I added functionality to use the cf_units module to convert a field to the desired units. This makes it possible to do for example the following:

da_temperature = nom.get_field(ds=ds, field_name=temperature, units="K")

instead of

if ds[temperature].units == "C":
    da_temperature = ds[temperature] + 273.15
    da_temperature.attrs["units"] = "K"
else:
    da_temperature = ds[temperature]

See for example b49fc40#diff-b6a50438da9d9fc4f5de22ebda3e229fL31

If you're all happy with this pull-request I'd like to merge it in.

@leifdenby leifdenby marked this pull request as ready for review August 18, 2020 13:16
@annalea-albright
Copy link
Collaborator

Much much better! Thanks a lot :)
And sorry I’ve gotten delayed on adding functions, but it’s on my to-do list!

@leosaffin
Copy link
Collaborator

I think it would be useful to have a separate pull request for the units feature (and the integration) because I can see a few things I would change for that. I think I would be happy merging the pull request following the commit 4060ef6 assuming it passes the tests.

matching_dataarrays = {}
vars_and_coords = list(ds.data_vars) + list(ds.coords)
for v in vars_and_coords:
print(v)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the print statement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks :)

@leifdenby
Copy link
Collaborator Author

I think it would be useful to have a separate pull request for the units feature (and the integration) because I can see a few things I would change for that. I think I would be happy merging the pull request following the commit 4060ef6 assuming it passes the tests.

How would you feel about merging this pull-request and starting a new issue to fix the bits you feel need improving? That would just save a me a few hours work undoing the changes I did to the tests and functions to make this work 😄

return da

if not HAS_UDUNITS2:
raise Exception(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to have a more specific exception here

@leosaffin
Copy link
Collaborator

leosaffin commented Aug 19, 2020

Fair enough. I've approved the changes (although it doesn't look like it made a difference). The changes I would make are

  1. I would put the unit conversion in a separate function and module because it doesn't cover nomenclature
  2. When using get_field in other functions it would be better to use it from eurec4a_environment rather than eurec4a_environment.nomenclature
  3. Considering the first two points, it would be good to change the name of the function in nomenclature (e.g. to get_field_by_name) and have the function get_field in the top level module call that function and the function for unit conversion. Then other functions are calling this top level get_field and you can add other functionality that might not be related to nomenclature to it.

I can make these changes once your pull request is merged if they sound OK to you. Other questions I had:

  1. Did you find any important differences between cf_units and cfunits? As far as I could tell they were basically the same but cf_units was more active and better documented.

  2. Was using iris considered at all? This already does the unit checking as well as coordinate checking and better support for CF standard names (although the CF standard naming could be a limitation).

@leifdenby leifdenby merged commit c789425 into eurec4a:master Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants