-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silently loads incorrect data when variables only differ by cell_methods #252
Comments
ooh, that's a nasty one - thanks for spotting it! |
should |
|
It might also be wise to assert that all attrs are the same, so it will fail noisily. |
Yes to all of these things, but currently we haven't added attributes to the querying AFAIK. I too was alarmed that it failed to fail |
I didn't spot it, Wilma did. |
On a quick glance it looks like Not every variable has a |
I think I was running to something like that... I don't remember exactly now but somehow I think #212 was related. Just bringing it up in case it helps... if not ignore. |
Or was it @dhruvbhagtani that was running on something like that... Yes, that's right! @dhruvbhagtani, I remember you were loading the velocity fields and some months they were all positive because the cookbook was grabbing the |
This one's also relevant COSIMA/access-om2#234 |
Thanks @aekiss, that's what I had in mind! This was what @dhruvbhagtani was banging his head with for some days. |
Sorry, I knew we hadn't actually implemented the querying part of it, but I wasn't aware it was causing so many headaches with silent loading of incorrect data. |
Querying with
fixes the problem, e.g.: import cosima_cookbook as cc
import xarray as xr
import matplotlib.colors as colors
session = cc.database.create_session()
expt = '01deg_jra55v140_iaf_cycle2'
lat_slice = slice(-80, -59)
start_time = '1990-01-01'
end_time = '1990-12-31'
u_ds = cc.querying.getvar(expt,
'u',
session,
start_time=start_time,
end_time=end_time,
frequency='1 monthly',
attrs={'cell_methods': 'time: mean'})
u = u_ds.sel(yu_ocean=lat_slice, time=slice(start_time,end_time)).sel(st_ocean=200, method='nearest')
u.plot(col='time', col_wrap=4, vmin=-0.2, vmax=0.2, cmap='RdBu_r') |
Can we make the Exporer suggesting the user to use |
Yes, I faced this issue sometime back. I then explicitly added the |
Adding For example, this takes an enormously long time (I guess it's looking through lots of files) ds = cc.querying.getvar(expt, 'dxu', session, n=1,
attrs={'cell_methods': 'time: mean'} ) and then fails with
but if I remove So the solution is not as simple as having Instead I think the internal logic in if ("cell_methods" in v.ncvar_attrs) and not("cell_methods" in attrs):
q = q.filter(v.ncvar_attrs.any(name="cell_methods",
value="time: mean")) (I expect @aidanheerdegen, @angus-g what do you think? |
Also see discussion here: #214 (comment) |
@aekiss, if this was a respond to my message, just to clarify that I wasn't suggesting Anything that would somehow inform users so that they are aware that they won't be getting what they ask for! If we can make sure the users know then this is good enough! Then the users could figure out ways around the issue. Solutions for how the loading could be automatically achieved via the cookbook could come at a later point. (btw I'm cc-ing @wghuneke on this issue) |
Ouch! I wonder how many existing analyses this may have affected that we didn't know about. A quick check of which IAF diagnostics have multiple cell_methods: |
Yeap, wouldn't be a bad idea. Hopefully no papers need retraction. |
@navidcy, I wasn't responding to your post, but to my own suggestion here. Warnings etc in the explorer would be helpful but I think it would be safer to have defaults and/or warnings/failsafe in |
Warnings in both! Definitely. |
another link to a related discussion: #137 |
Hi @aidanheerdegen, this doesn't seem completely fixed. When I do this with analysis3-21.07 import cosima_cookbook as cc
session = cc.database.create_session()
info = cc.querying._ncfiles_for_variable('01deg_jra55v140_iaf', 'sea_level', session, frequency='1 monthly')
files = [str(f.NCFile.ncfile_path) for f in info] I get a mix of time-means and snapshots, and no warnings:
and I need to use info = cc.querying._ncfiles_for_variable('01deg_jra55v140_iaf', 'sea_level', session, frequency='1 monthly', attrs={'cell_methods': 'time: mean'}) to get only the time-means. Should this if attrs_unique is None:
attrs_unique = {} be replaced with if attrs_unique is None:
attrs_unique = {"cell_methods": "time: mean"} as in |
Hi Andrew. It's an internal routine (begins with Am i correct in assuming Is there a reason you need to use |
I was just using it to verify the files that are opened, following your code from the first post of this issue. How would I check |
You've verified that it does the right thing because when you pass the default argument it returns the correct file list. You can call In [1]: import cosima_cookbook as cc
...: session = cc.database.create_session()
In [2]: cc.querying.getvar('01deg_jra55v140_iaf', 'sea_level', session, frequency='1 monthly')
Out[2]:
<xarray.DataArray 'sea_level' (time: 732, yt_ocean: 2700, xt_ocean: 3600)>
dask.array<concatenate, shape=(732, 2700, 3600), dtype=float32, chunksize=(1, 540, 720), chunktype=numpy.ndarray>
Coordinates:
* xt_ocean (xt_ocean) float64 -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
* yt_ocean (yt_ocean) float64 -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
* time (time) datetime64[ns] 1958-01-16T12:00:00 ... 2018-12-16T12:00:00
Attributes:
long_name: effective sea level (eta_t + patm/(rho0*g)) on T cells
units: meter
valid_range: [-1000. 1000.]
cell_methods: time: mean
time_avg_info: average_T1,average_T2,average_DT
coordinates: geolon_t geolat_t
standard_name: sea_surface_height_above_geoid
time_bounds: <xarray.DataArray 'time_bounds' (time: 732, nv: 2)>\ndask...
In [3]: but I take your point that it is difficult to determine that this has done the right thing without checking out the fields. Equally the test cases do check that this throws an error as you would expect, so sometimes you either have to dig right in to satisfy yourself that it is correct, or just trust that it is working as expected. I have thought for a while that it would be good to return a list of files as an attribute to make it clear where the data came from. |
Yeah good point re. verification. |
When there are two variables with the same name and frequency but different
cell_methods
attribute, e.g. different averaging method, thegetvar
function will silently load quasi-random sections from the two different variables.As an example:
creates the following facet plot.
Clearly April to June have much lower velocities at 200m depth. This is because this data is from the same variable but with a different averaging (
pow02
) is being picked up for those 3 months. You can see this in the list of files returnedreturns
The immediate solution is to add an
ncfile
option to thegetvar
query:then the code above generates a correct plot:
This is a serious bug as it is silently ignored. So users can end up loading incorrect data without any error or warning from the Cookbook.
The text was updated successfully, but these errors were encountered: