-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regridded tos
fields are different with Dask<=2024.8.2
and Dask>=2024.9.0
#2607
Comments
Thanks @schlunma |
@schlunma is there any variability? ie if you run the same thing 2-3 times, do you get the same results? Can you do a diff between the data between the two outputs pls, ie |
No, no variability. It's exactly reproducible (I ran the recipe at least 20 times). I did a Will try to reproduce this without using ESMValCore now. |
Good pointers Manu! Ok so next step is trying to isolate the source - will help too, later today |
This comment may be helpful in finding the source of the problem: dask/dask#11296 (comment) |
Could be the masking at play, my original issue at iris was SciTools/iris#6109 (maybe useful for the MRE). But diffs of 80% (ie not of billions of billions percent) suggest to me that there is a different problem - not fillvalues becoming actual data points. Sounds like the compiler that was used for Dask or any other numerical library we or another of our deps uses. Manu, have you checked that, even if versions are identical, builds like nompi or mpich are also identical? |
I am using Package Version Build Channel Size
─────────────────────────────────────────────────────────────────
Upgrade:
─────────────────────────────────────────────────────────────────
- dask 2024.8.2 pyhd8ed1ab_0 conda-forge Cached
+ dask 2024.9.0 pyhd8ed1ab_0 conda-forge Cached
- dask-core 2024.8.2 pyhd8ed1ab_0 conda-forge Cached
+ dask-core 2024.9.0 pyhd8ed1ab_0 conda-forge Cached
- dask-expr 1.1.13 pyhd8ed1ab_0 conda-forge Cached
+ dask-expr 1.1.14 pyhd8ed1ab_0 conda-forge Cached
- distributed 2024.8.2 pyhd8ed1ab_0 conda-forge Cached
+ distributed 2024.9.0 pyhd8ed1ab_0 conda-forge Cached
Summary:
Upgrade: 4 packages So Dask is really the only package that's different between the two versions. The issues gets even weirder: the bug does not happen with other time ranges. Currently, the recipe uses 1980-2004, but if all years are used the data looks fine. I also tried to isolate this by setting up a simple Python script, but couldn't reproduce this without using our |
All right, here's a little script to reproduce this without using ESMValCore. The problem appears only if the start year is 1980; with 1979, everything is fine. import iris
import numpy as np
from iris.coords import DimCoord
from iris.cube import Cube
from esmf_regrid.schemes import ESMFBilinear
from iris.time import PartialDateTime
path = "/work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/tos/gn/v20190308/tos_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc"
def extract_time(cube):
# start_datetime = PartialDateTime(1979, 1, 1) # no problem
start_datetime = PartialDateTime(1980, 1, 1) # problem
end_datetime = PartialDateTime(2004, 12, 31)
time_coord = cube.coord('time')
dates = time_coord.units.num2date(time_coord.points)
select = (dates >= start_datetime) & (dates < end_datetime)
# Note: if dates are selected via ints, e.g., cube[1560:1860, ...] or an iris.Constraint, everything is fine as well.
# It appears that we need boolean indexing for the bug to appear...
return cube[select, ...]
def regrid(cube):
lat = DimCoord(np.linspace(-89, 89, 90), standard_name="latitude", units="degrees")
lon = DimCoord(np.linspace(1, 359, 180), standard_name="longitude", units="degrees")
target_cube = Cube(np.ones((90, 180)), dim_coords_and_dims=[(lat, 0), (lon, 1)])
return cube.regrid(target_cube, ESMFBilinear())
cube = iris.load_cube(path)
cube = extract_time(cube)
cube = regrid(cube)
path_regridded_cube = "bug.nc"
iris.save(cube, path_regridded_cube)
print("Saved", path_regridded_cube) Comparison of output data: $ # Starting date 1979
$ cdo diffn 1979_dask-2024-8-2.nc 1979_dask-2024-9-0.nc
cdo diffn: Processed 10108800 values from 2 variables over 624 timesteps [0.10s 42MB] $ # Starting date 1980
$ cdo diffn 1980_dask-2024-8-2.nc 1980_dask-2024-9-0.nc
Date Time Level Gridsize Miss Diff : S Z Max_Absdiff Max_Reldiff : Parameter name
2 : 1980-02-14 00:00:00 0 16200 0 10251 : T F 8.5582 0.99935 : tos
3 : 1980-03-15 12:00:00 0 16200 0 10251 : T F 13.328 0.99821 : tos
4 : 1980-04-15 00:00:00 0 16200 0 10251 : T F 5.2479 0.99882 : tos
5 : 1980-05-15 12:00:00 0 16200 0 10251 : T F 6.7898 0.99807 : tos
6 : 1980-06-15 00:00:00 0 16200 0 10251 : T F 8.7143 0.99734 : tos
7 : 1980-07-15 12:00:00 0 16200 0 10251 : T F 13.717 0.99826 : tos
8 : 1980-08-15 12:00:00 0 16200 0 10251 : T F 3.6858 0.99967 : tos
9 : 1980-09-15 00:00:00 0 16200 0 10251 : T F 10.937 0.99952 : tos
...
299 of 300 records differ
cdo diffn: Processed 9720000 values from 2 variables over 600 timesteps [0.08s 41MB] I am utterly confused 🤯 I will work on something different for now, need a break from this.. |
hey Manu, my turn to look into this, have just started testing with your sripty-script above, and offending file 🍺 |
Just to throw another problem here, I was running the #2178 recipe with my new environment with dask version 2014.11.2 and for the hourly data I ran into divide by 0 error coming from dask in the climate statistics preprocessor. After seeing this issue, I downgraded dask to 2024.8.2 and it ran fine. So there's probably several things that are impacted by recent dask changes. I assume it's different dask issues, as I get errors while Manuel has the more critical running but faulty output here. |
OK so using Manu's script above, and Dask 2024.8.0 or 2024.9.0, I'm getting these values for the mean of so clearly an issue with masked values to start with, then, a mean on data that is below, say, 1000: Dask 2024-8 mean data 14.1161852220453 so this suggests to me at least the mean is almost identical; I went ahead and did the full sets of stats: Dask 2024-8 mean data 14.1161852220453 This suggests to me these datasets are identical apart from the fillvalue that is counted as actual data point in Dask 2024.8 - and that;s bc you slice through, and that's the issue I raised with Iris back in August |
Did you create two different It appears to me you only created one and then evaluated this with two different Dask versions. Since dask/dask#11296, you'll get the different mean in 2024-8-2. |
yes Manu, of course I created two different files 😄 - one per each Dask version |
|
note that my Dask 8 was 2024.8.0 not 2024.8.2 - am just about to test with .2 now |
so with Dask 2024.8.2 the masking issue disappears, and am getting almost identical results: Dask 2024-8 mean data 14.11618522204531 Note that I am using esmvalcore from the latest development branch, not 2.11.1 as you did, so I am now going to test with v2.11.1 |
My environment is also derived from the latest main branch. Could you try |
|
Yes! This is what I got, too. Very interesting that all statistics you calculated seem to be the same, but the actual data is still different... |
shifted most probably, lemme try get some sense of how the data is arranged |
yep indeed, points differ: print(c1dat[100, 30, 50]) 19.318310830559554 but exactly 17.2... lives in the other file at index (array([56]), array([30]), array([50])), and coverseley, 19.3... lives in the other file at index (array([144]), array([30]), array([50])) exactly 44 time points shifted from 100 like the other one, but +44 not -44 |
OK finally finally figured this shtuff out! 🍺 Stay put for shtuff |
Considering SciTools/iris#6251 (comment), would adding a pin on our dask version make sense? |
no, hang on, I wanted to reply to the iris issue, but then I forgot - lemme go there, thanks for the reminder, bud 🍺 |
So I'd really go with not using bool slices anymore - deffo a bug in Iris, they are not yet ready to fix it, pinning Dask at such an old version will only give us more headaches |
As I already mentioned in the PR description, I think pinning Dask to 2024.8.2 as a temporary solution would probably make sense. With the latest release of 2024.12.1 yesterday, we even got more problems with Dask: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/12287/workflows/59e8e9e5-a46a-46a7-8cee-7e5d3958bfe9/jobs/51396... |
we don't see those barfs in the GA tests though - there could be interplay with other dev packages https://github.com/ESMValGroup/ESMValCore/actions/runs/12384390417/job/34568856143 OK, iris have decided to pin Dask anyway, so there's not much room for wiggle for us here - hopefully we're not getting hit badly with such an old Dask, plus, I got about one hour left for work this year, so am not gonna split the Red Sea in this last hour of work 😁 🎄 |
Yeah, hopefully we can sort this out very soon. Looks like another big problem with Dask (distributed) will be solved in one of the upcoming releases. |
Hi, I have a fun one.
While testing #2517 using
recipe_schlund20esd.yml
, I found something very weird. For some models (not all!), one specific plot looks vastly different:I could trace this back to the Dask version. Both plots use the current main branch of ESMValCore and ESMValTool, and all dependencies except for Dask are identical. The issue started appearing in Dask 2024.9.0 and also appears in the current version (2024.12.0, released yesterday). The latest version that looks ok is 2024.8.2.
This plot is derived from three variables. Only one preprocessed dataset is different in those two versions:
tos
, which is preprocessed like this:Here is one time slice (March 1980) of the original data and the preprocessed data with the two different Dask versions:
At first glance, both regridded datasets appear fine. However, if you look closely, you can see huge differences (e.g, 6°C in the Mediterranean Sea) between the two, but the one in the earlier Dask version clearly is more similar to the input data (which is what we want). Thus, the old version is clearly the correct one here.
I am honestly shocked that something like this could happen. The new field looks okayish and reasonable on its own, so without proper analysis and comparison to the original data you probably would think it's okay. This is super dangerous and makes this recipe not reproducible...
I have no idea what's going on here. Since all other dependencies are identical (e.g. numpy 1.26.4, iris 3.11, iris-esmf-regrid 0.11.0, netcdf4 1.7.2), this must be an issue of Dask. However, since the corresponding code is burried under many layers (iris-esmf-regrid, iris), it's very hard to find the actual problem.
@ESMValGroup/esmvaltool-coreteam any ideas? Should we pin Dask for now as an immediate fix and then try to dig deeper to find the underlying problem?
tos
fields are different withDask<=2024.8.2
andDask>=2024.9.0
#2607 (comment)The text was updated successfully, but these errors were encountered: